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ABSTRACT 

Graph-based methods play an important role in unsupervised and 
semi-supervised learning tasks by taking into account the underly¬ 
ing geometry of the data set. In this paper, we consider a statistical 
setting for semi-supervised learning and provide a formal justifica¬ 
tion of the recently introduced framework of bandlimited interpola¬ 
tion of graph signals. Our analysis leads to the interpretation that, 
given enough labeled data, this method is very closely related to a 
constrained low density separation problem as the number of data 
points tends to infinity. We demonstrate the practical utility of our 
results through simple experiments. 

Index Terms — Graph signal processing, semi-supervised learn¬ 
ing, interpolation, asymptotics 

1. INTRODUCTION 

Recently, graph-based methods have been employed very success¬ 
fully in solving the semi-supervised learning (SSL) problem mm. 
The underlying approach involves constructing a geometric graph 
from the data set, where the nodes correspond to data points and 
the edge weights indicate similarities between them, generally com¬ 
puted as a function of their distance in the feature space. These meth¬ 
ods are particularly attractive as they allow one to introduce priors 
for smoothness, or local and global consistency in the data labels 
(see for example, the graph Laplacian regularizer f T Lf and its vari¬ 
ations fflEI). 

An insightful way of justifying graph-based learning algorithms 
is to study their behavior on statistical data in the large sample limit. 
Several papers have analyzed the stochastic convergence of cuts on a 
similarity graph constructed from data points sampled from a proba¬ 
bility distribution p(x). As the sample size goes to infinity and for a 
specific graph construction scheme, the cut is shown to converge to a 
weighted volume of the boundary: f gs p a (s)ds for some a > 0 that 
depends on the graph definition |4|. These results serve as a justi¬ 
fication for spectral clustering, since searching for the minimum cut 
on the similarity graph is equivalent to a low density separation prob¬ 
lem in the asymptotic limit. Similar arguments hold for SSL prob¬ 
lems, where the regularizer f T Lf has been shown to converge to a 
weighted energy expression of the form: f || V/(x)|| 2 p“(x)dx QQ. 
Using this expression as a penalty ensures that the predicted labels 
do not vary much in regions of high density. 

More recently, SSL has also been viewed from a graph signal 
processing perspective, where class indicator vectors are considered 
as smooth signals defined on the similarity graph (see muni for an 
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overview on graph signal processing). Specifically, in this setting, 
one incorporates smoothness in the indicator vectors by approximat¬ 
ing them with bandlimited or lowpass signals with respect to the 
graph’s Fourier basis. The advantage of such an approach lies in the 
fact that, by using the sampling theorem for graph signals ®, it is 
possible to state conditions that guarantee perfect prediction of the 
unknown labels. Then, the task of learning simply translates to one 
of recovering a bandlimited graph signal from its known sample val¬ 
ues mmm. We call this approach Bandlimited Interpolation of 
Graph signals (BIG). 

However, using BIG for SSL does not have a very clear theo¬ 
retical justification. Moreover, its connections with existing graph- 
based methods in SSL are not fully understood. Specifically, one 
needs to consider the following questions: firstly, how does the in¬ 
terpolated class indicator signal compare to other indicator signals 
satisfying the label constraints? And secondly, how does the band¬ 
width of class indicator signals relate asymptotically to p(x) in the 
statistical setting for SSL? 

The focus of this work is to provide a formal justification for 
BIG, and draw connections with existing methods. We answer the 
first question using the graph sampling theorem: given enough la¬ 
beled data, the interpolated indicator signal has minimum bandwidth 
among all indicator signals that satisfy the label constraints. We then 
show in a statistical setting that an estimate of the bandwidth for any 
indicator signal, on a specifically constructed graph, asymptotically 
matches the supremum value of the probability distribution over the 
corresponding decision boundary associated with the indicator, as 
the number of data points, and thus the graph size, goes to infinity. 
The two results put together suggest an interpretation for the BIG 
approach in SSL problems: given, enough labeled data, BIG learns a 
decision boundary that respects the labels and over which the maxi¬ 
mum density of the data points is as low as possible, similar to other 
graph-based methods. In summary, we observe from our result and 
previous analyses of spectral clustering that asymptotically, there is 
a strong link between the value of a cut and the bandwidth of its 
associated indicator signal. Thus, the geometric properties desired 
of “minimal cuts” in clustering translate to those of “minimal band¬ 
width” indicator signals for classification in the presence of labels. 

2. GRAPH-BASED LEARNING 

We now introduce the problem setting considered in this paper. 

Data Model: We assume that the data set consists of n random fea¬ 
ture vectors X = {Xi, X 2 ,..., X„} drawn independently from 
some probability density function p(x) on R d . Let dS be a smooth 
hypersurface that splits R d into two disjoint parts S and S c (multi¬ 
class problems can be modeled using the one-vs-all approach). Fur- 



ther, let A's = X PI S and Xgc = X n S c be the set of points that 
land in S and S c respectively. We denote the indicator vector for Xs 
by Is £ {0, l} n : ls(*) equals 1 if Xi £ Xs and 0 otherwise. 
Learning task: We consider the problem of semi-supervised learn¬ 
ing, where the labels of a small subset of data points Xl C X 
are known and the task is to predict the labels of the unlabeled set 
Xu = X\Xl- More precisely, we would like to obtain 1 s{U) from 
X and ls(L), where ls(U) £ {0, l}l A ul alK j 1 S (L) e {0, l}^ Xjj 
denote the membership, with respect to Xs, of the unlabeled and la¬ 
beled sets of points respectively. 

Graph model: We construct a distance-based similarity graph with 
data points as nodes and edge weights given by the Gaussian kernel: 


t Uij — if .: 2 (Xi, Xj ) 
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Further, we assume wa = 0, i.e., the graph does not have self¬ 
loops. The adjacency matrix of the graph W is a symmetric ma¬ 
trix with elements Wij, while the degree matrix is a diagonal matrix 
with elements Dl = JW w ij- We define the graph Laplacian as 
L = -(D — W). Normalization ensures the norm of L is stochas¬ 
tically bounded as n increases. 


2.1. Spectral Clustering on Graphs 

Convergence of cuts has been studied before in the context of spec¬ 
tral clustering, where one tries to minimize the graph cut across two 
partitions of the nodes. Note that the empirical value of the graph 
cut induced by the boundary dS can be expressed in terms of the 
indicator vector Is for S and the graph Laplacian as: 

Cut(S,S c )= ^2 wytsslsLls- (2) 

i(zS,j€S c 


Theorem 2. Under the conditions a —> 0 and na d —> oo, 

-^f T Lf /||V/(x)||V(x)dx, (5) 

no z J 

where for each n, f is a vector representing the values of /(x) at the 
n sample points and C is a constant factor independent ofn and a. 

Similar to the justification of spectral clustering, this result justifies 
the formulation in {4} for SSL: Given label constraints, the predicted 
signal must vary little in regions of high density. 

2.3. Bandlimited Interpolation of Graph signals (BIG) 

The task in BIG is to recover a bandlimited signal closest to the 
indicator signal satisfying the label constraints. Let w(f) denote the 
bandwidth of a signal f and PW^.(G) (Payley-Wiener space with 
cutoff frequency uj El) denote the set of oj-bandlimited signals on 
the graph G, i.e., PW U1 (G) = {f | w(f) < cj}. Then, the BIG 
method essentially consists of 

1. Estimating the cut-off frequency ljl associated with the labeled 
set Xl using the sampling theorem for graph signals (9'j. 

2. Estimating the desired indicator vector Is from labels 1 s(L) by 
solving the following least-squares problem: 

f LS = argmin ||f(L)-l s (L)|| 2 s.t. f £ PW„ L (G). (6) 

f 

This method has been considered earlier ED, albeit, with an arbi¬ 
trary choice of cjl- Note that if the original indicator Is is bandlim¬ 
ited with respect to the labeled set, (i.e., ut(ls) < wl), then the 
estimate fLs in 0 is guaranteed to be equal to Is as a consequence 
of the sampling theorem. Moreover, in this case. Is can also be 
perfectly estimated by the solution of the following “dual” problem: 


It has been shown in i) that the following convergence theorem f min = arg min w(f) s.t. f(L) = ls(L), (7) 

(stated in a simple form) holds for hyperplanes dS in R d : f 

Theorem 1. Under the conditions a ->■ 0 and na d+1 -> oo, These facts leads t0 the following insight regarding BIG for SSL: 



where ds ranges over all (d — l)-dimensional volume elements tan¬ 
gent to the hyperplane dS. 

A similar result has been shown earlier for smooth hypersurfaces 
ED- The condition a —> 0 leads to a clear and well-defined limit on 
the right hand side. Intuitively, it enforces sparsity in the similarity 
matrix W by shrinking the neighborhood volume as the number of 
data points increases. As a result, one can ensure that the graph 
remains sparse even though the number of points goes to infinity. 

The result above has significant implications for spectral clus¬ 
tering: With certain scaling, the empirical cut value converges to a 
weighted volume of the boundary, thus spectral clustering is a means 
of performing low density separation on a finite sample. 

2.2. Graph Laplacian Regularization for SSL 

In SSL, one generally exploits the availability of labeled samples to 
reconstruct an unknown function f as follows: 

Minimize f T Lf such that f (L) = ls(L). (4) 

Note that f is generally not restricted to be an indicator and is taken 
to be a smooth signal in R n . One particular convergence result in 
this setting can be stated as follows El ED: 


Observation 1. Ifuo(ls) < uil, then 

1. Is can be perfectly recovered using either © and 0- 

2. Is is guaranteed to have minimum bandwidth among all indi¬ 
cator vectors satisfying the label constraints 1 s(L) on Xl. 

The observations above have significant implications: Given enough 
and appropriately chosen labeled data, BIG effectively recovers an 
indicator vector with minimum bandwidth, that respects the label 
constraints. Note that by labeling enough data appropriately, we 
mean to ensure that the cut-off frequency ujl of the labeled set is 
greater than the bandwidth uj(1s) of the indicator function of inter¬ 
est. If this condition is not satisfied, both observations break down, 
i.e., the solutions of 0 and 0 would be different and serve only 
as approximations for Is- Moreover, the minimum bandwidth sig¬ 
nal f m in satisfying the label constraints, would differ from Is and 
may not even be an indicator vector. To help ensure that the condi¬ 
tion is satisfied, one can use efficient optimal algorithms for label¬ 
ing i l2lfl6l . We note that in practice, 0 can be solved via efficient 
iterative techniques CD. 

3. MAIN RESULT 

We now consider the convergence of the bandwidth uj(1s) of Is, as 
the number of data points goes to infinity. To simplify our analysis, 
we need certain assumptions: p(x) must be Lipschitz continuous 




and twice differentiable on R d and dS must be smooth with radius 
of curvature r > 0. Next, we note that the bandwidth of Is, with 
respect to the Fourier basis specified by L, can be written as (9) 


cj(Is) = lim 

m —Foo 

where w m (ls) is the tn th order bandwidth estimate defined as: 


i(!s) = 


1 T t m i 

Is Is 


1/m 


(8) 


(9) 


We now show that for the distance-based similarity graphs of ijTJ, the 
bandwidth estimate converges to a function of p(x), thus giving the 
connection between the BIG approach and the low density separation 
problem. Our result holds under the following set of conditions: 

1. Large sample size: n —» oo, 

2. Shrinking neighborhood volume: a —> 0, 

3. Bandwidth estimate: m —» oo, m/n —> 0, mo 2 

4. (l/a) 1/m -> 1, 

5. (na md+1 )/(mC m ) ->• oo, where C = 2/{2Tt) d ' 2 . 


0, 


Theorem 3. If conditions 1-5 hold, then 


sup p(s), 
seas 


( 10 ) 


where “p. ” denotes convergence in probability. Further, almost sure 

md +1 

convergence holds if condition 5 is replaced by log n —¥ oo. 

Intuitively, the conditions 1-5 guarantee sparsity of the graph and 
govern the scaling of the bandwidth estimate order. The theorem es¬ 
sentially states that the estimate of the bandwidth of any indicator 
vector converges to the supremum of the underlying probability dis¬ 
tribution on the corresponding decision boundary. We now specify a 
graph construction scheme for which the result holds. 

Corollary 1. Equation < |10| i holds if for each value ofn, we choose 
the parameters a and m as follows 


cr = n- x/(md+1 \ 0 < a; < 1, 

m = (\ogn) v , l/2<y<l, 


( 11 ) 

( 12 ) 


This result, along with the conclusions derived from the sampling 
theorem for graph signals in the previous section, forms the basis of 
justifying BIG as an effective method for SSL: Given enough and 
appropriately chosen labeled data, BIG learns that decision bound¬ 
ary on which the supremum of the data density is minimum. Based 
on this, the following conclusions become apparent: 

1. BIG is a variant of the constrained low density separation prob¬ 
lem for finite number of data points, similar to other methods. 

2. To learn a boundary that passes through a region of high proba¬ 
bility density, more labeled data is required. 

3.1. Proof sketch 

We now give an overview of the proof of Theorem[3] For our analy¬ 
sis, we consider the quantity Y m defined for m £ IF as: 


Y = 

J m 


l 


1 T t m -| 

15^ Iff 
Isis 


We prove the following convergence result: 

(Y m ) 1/m ^4 (E{y m }) 1/m —> 


sup p( s) 
seas 


(13) 


(14) 


where the second arrow denotes sure (deterministic) convergence. 
Since ( 1/a ) 1 / m —y 1 (condition 4), we can reach the desired result 
of (10) from ( | 1 4| ) through a simple argument. Before providing a 
sketch of the proof for G3- we first discuss how they rely on the 
conditions in the Theorem's statement. Conditions 1 and 5 are re¬ 
quired to ensure stochastic convergence of the left hand side of 03- 
Conditions 2 and 3 are required to show sure convergence of the 
right hand side of ( |14| l. The proof of ( | 1 4| > begins by re-expressing 

— l 7 L m l ‘ * 1 ‘ 

Y m as ag T 4 ——, and studying the convergence of the numerator 


and denominator separately. By the strong law of large numbers, we 
conclude that 

-Isis -^4 [ p(x)dx. 

n J s 


(15) 


For the numerator, we decompose it into two parts - a variance term 
for which we show stochastic convergence and a bias term for which 
we prove deterministic convergence. Let V = /^lsL m ls, then 
we have the following results for V and E {V}\ 

Lemma 1 (Concentration). For every e > 0, we have: 

Pr (| V — E {V}| > e) 


< 2 exp 


-\n/(m + 1 )\o 


rad+1 2 


2C m E {V} + f | C m - ( j md + 1 E {P}| e 


(16) 


where C = 2/(2n) d ^ 2 . Note that the right hand side goes to 0 when 
condition J holds. 


Proof sketch. We begin by expanding V as follows: 
V= — ls(D- W) m l s 


ra+1 


Y. i?(Xn,X i 2 ,...,X im+1 ). 


(17) 


(18) 


The above expansion has the form of a V-statistic. Recalling that 
Wi,j = K(X.i, Xj), we note that g is composed of a sum of 2 m 
terms, each a product of m kernel functions. Therefore, 


g<-2 m \\K\\Z = - 2W2 

a a \[Z7ra z ) a/z 


C” 




(19) 


In order to apply a concentration inequality for V, we first re-write it 
in the form of a U-statistic by regrouping terms in the summation so 
that repeated indices are removed, as given in m 


V = 


1 


n (m + !) 


"y ] g (Xi x , X i2 ,..., Xi, 


( 20 ) 


(n,m+1) 


where m + 1 ) denotes summation over all (m+1 (-tuples of dis¬ 
tinct indices taken from the set {l,...,n}, = n.(n — 

1)... (n — m) is the number of (m+l)-permutations of n and g* is 
a convex combination of certain values of g that absorbs repeating 
indices satisyfing the property: 

n (" l + 1 ) 

g * (xi,X 2 ,...,X m+ i) = m+1 g(xi,X 2 ,...,X m+1 ) (21) 


+ 


O ( —) (terms with repeated indices). 


Therefore, g * has the same upper bound as that of g derived in (19). 
Moreover, using the fact that E {V} = E {<?*}, we can bound the 
variance of g* as 


Var{p*} < E {(<?*) 2 } < ||ff*||ooE{ 5 ‘} = 


C” 


Q-md -\-1 


E{C}. (22) 













Finally, plugging in the bound and variance of g* in Bernstein’s in¬ 
equality for U-statistics LlZllU, we arrive at the result of ( | 16| ). □ 

Lemma 2 (convergence of bias). As n —> oo, a —> Oandma 2 —r 0, 
we have 



where t(m) = E™" 1 (T'K" 1 (W - s/r). 

Proof sketch. We use the following properties of I\ a 2 (x, y): 


J K a 2 (x, y)p(y)dy = p(x) + O (a 2 ) , 

J K aa 2 (x, z)K ba 2 (z, y)p(z)dz = K {a+b)a 2 (x, y) p 

+ 0(a 2 ). 


(24) 

/ bx + ay \ 
\ a + b J 

(25) 


We evaluate E { V} term by term by writing L m = (D—W) m_1 (D— 
W). For all terms in the expansion of (D — W) m_1 containing r 
occurrences of W, we use (|24|> and (25} and ma 2 —> 0 to get 


E 


T 

y [D 


W r ](D — W)y 

HU J 

- [ [ K rcT 2 (x,y)p a (x)p l3 (y)dxdy 
JsJs 

K {r+ i W (x, y)p a (x)p p (y)rfxdy + 0(<r), (26) 


s Js 


where a + /3 = m + 1 and a + /3' = m + 1. It can be shown that 
the right hand side of (26} converges to J gs p m+1 (s)ds. 

Putting everything together, we get the desired result. □ 


Finally, we note that asm.-> 00 , we have 


\ fs J 


sup p(s). 

s EdS 


(27) 


4. EXPERIMENTAL RESULTS 


In this section, we numerically analyze our asympotic results and 
show that they are also useful in practice. For our experiments, 
we considered a 2-D Gaussian mixture model with three Guassians: 
pi = [— 2 , 0 ], Ei = 0.641,^2 = [0,0], E 2 = 0.251 and p 3 = 
[ 2 , 0 ], E 3 = 0.161, with corresponding mixture proportions: qi = 
0 . 5 ,a 2 = 0 . 2,03 = 0.3. The plot of the density is given in Figure]!] 
For computing edge weights of the graph, we set 0 = 0.1. 

In our first experiment, we studied the behavior of the empiri¬ 
cal bandwidth estimate u; m (ls) with n for different values of m. 
We used sample sizes varying from n = 500 ton = 2500, drawn 
i.i.d. from the pdf, to compute u; m (ls) with m = 10,20,30 for 
the 2D hyperplane dS : x = 0. This experiment was repeated 100 
times and the mean was compared with the supremum of the bound¬ 
ary (Figure |2}. We observe that as m increases, the mean empirical 
bandwidth estimate approaches the theoretical limit (for a fixed m, 
the mean value decreases slightly with n since for a higher n, the 
rate of convergence of w m (ls) with m is slower). Further, as n in¬ 
creases, the standard deviation of the empirical bandwidth decreases, 
indicating asymptotic convergence of the empirical quantity. 

Next, we validate the result of Theorem [3] for different bound¬ 
aries. This is carried out as follows: we fix the bandwidth approxi¬ 
mation factor to m = 20 and compare cu m (ls) with sup sgas p( s). 


for different positions of the boundary dS : x = c (obtained by 
sweeping c as shown in Figure}!}. This procedure is canned out 100 
times and the results are shown in Figure]!] We observe that the 
empirical and the limit values are fairly close to the supremum of 
p(x) over the boundary, the slight gap arises due to finite m. The 
overshoot of the empirical quantity over the supremum for some po¬ 
sitions of the boundary happens because <7 is not small enough for 
convergence of the bias term at those parameter settings. 


0.3 



Fig. 1: 2D GMM used in experiments. Family of hyperplanes x = c 
that cut perpendicular to the first dimension (the “informative” di¬ 
mension for the pdf) are taken as decision boundaries dS. 


m = 10 m = 20 m = 30 



Fig. 2: Convergence of uj rn (lg) with n for the boundary dS : x = 0 
and different m. 0 is fixed at 0.1. Shaded area indicates standard de¬ 
viation over 100 experiments. Red-dashed line shows sup sgS p(s). 



Fig. 3: Convergence of oi m (ls) with m = 20 for varying hyper¬ 
plane parameter c. n and 0 are fixed at 2500 and 0.1. Shaded area 
indicates standard deviation over 100 experiments. Red-dashed line 
shows sup s6S p( s). 


5. SUMMARY 

In this paper, we provided an asymptotic justification of using the 
bandlimited interpolation of graph signals (BIG) approach for semi- 
supervised learning (SSL). We considered a statistical setting and 
computed the limiting value of the bandwidth estimate for any indi¬ 
cator signal defined on a distance-based similarity graph that is fairly 
common in practice. As a consequence of our result and the sam¬ 
pling theory for graph signals, the BIG approach for SSL is found to 
be closely related to the low density separation problem. We show 
through experimental analysis that the theoretical results are useful 
in practical scenarios. In future work, we aim to exploit this result 
for finding the label complexity of any indicator signal in the “BIG 
for SSL” framework, and comparing the BIG approach with existing 
methods, to further understand the value of labeled data. 
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